fake(1) and real(0) news summary statistics
total_dict_words – average total of dictionary words in an article
prop_dict_words - of total words in an article, what proportion of the words are in a dictionary? (average)
What about Clinton?
## # A tibble: 2 x 2
## fake tot
## <dbl> <int>
## 1 0 42
## 2 1 5
Clinton was mentioned 42 times in real news and only 5 times in fake news.
Main sources of fake and real news?
Top real news sources
## # A tibble: 10 x 2
## source n
## <chr> <int>
## 1 http://politi.co 43
## 2 http://cnn.it 22
## 3 http://abcn.ws 9
## 4 http://occupydemocrats.com 9
## 5 http://eaglerising.com 5
## 6 http://www.addictinginfo.org 5
## 7 http://author.addictinginfo.org 4
## 8 http://rightwingnews.com 4
## 9 http://conservativebyte.com 3
## 10 http://freedomdaily.com 3
Top fake news sources
## # A tibble: 10 x 2
## source n
## <chr> <int>
## 1 <NA> 29
## 2 http://uspoln.com 6
## 3 http://thelastlineofdefense.org 4
## 4 http://freedomcrossroads.us 3
## 5 http://departedmedia.com 2
## 6 http://newsfeedhunter.com 2
## 7 http://politicono.com 2
## 8 http://politicot.com 2
## 9 http://thenewyorkevening.com 2
## 10 http://undergroundnewsreport.com 2
What is the similarity between words in fake vs real news. Word correlation between those in fake vs real news.
Pearson Correlation of 87%
## 0 1
## 0 1.0000000 0.8658092
## 1 0.8658092 1.0000000
Visualizing word correlations
Pairwise correlation
Bigram correlation (tf-idf)
These words are, as measured by tf-idf, the most important bigrams in each news type, meaning these are the top phrases thatmost distinguish fake news from real news articles.
unigram correlation (tf-idf)
This is the unigram version of the plot above. These particular words are the most important to each respective news type.
trigram correlation (tf-idf)
Average ‘afinn’ sentiment score between real (0) and fake(1) news sources
## # A tibble: 2 x 2
## fake sum_sent
## <dbl> <dbl>
## 1 0 -646
## 2 1 -899
## # A tibble: 2 x 2
## fake sum_avg
## <dbl> <dbl>
## 1 0 -10.4
## 2 1 -21.0
## # A tibble: 2 x 2
## fake avg_sent
## <dbl> <dbl>
## 1 0 -0.223
## 2 1 -0.465
first table: sum all sentiment values by group
second table: sum of afinn values by article, average by news type
Not a strong consensus on how to compare aggreagte sentiment scores. In many applications people use averages because documents can vary in length. “If you have a very long document you might see more positive or negative words, but this can simply be a function of having more words overall. If your document lengths are similar, then summing might make more sense” (Prof. Mitts).
This is a good example though on how the interpretation of sentiment analysis results varies with methodology.
For the three plots below:
Each line represents the average AFINN score by article.
Each line represents the sum of AFINN scores by article.
Each line represents the net BING score by article (positive words counts - negative word counts)
words that contributed to the wrong sentiment direction
## # A tibble: 2 x 3
## # Groups: fake [2]
## fake mean_word_sent_disgust mean_sent_prop_disgust
## <dbl> <dbl> <dbl>
## 1 1 5.80 0.028
## 2 0 7.01 0.0256
unigram correlation (tf-idf)
These words are the most important ‘disgust’ unigrams to each respective news type.
## # A tibble: 2 x 3
## # Groups: fake [2]
## fake mean_word_sent_fear avg_prop_fear
## <dbl> <dbl> <dbl>
## 1 1 10.4 0.0502
## 2 0 17.0 0.0549
mean_word_sent – average number of ‘fear’ words in an article avg_prop_fear – average proportion of ‘fear’ words in an article
unigram correlation (tf-idf)
These words are the most important ‘fear’ unigrams to each respective news type.
## # A tibble: 2 x 3
## # Groups: fake [2]
## fake mean_word_sent_joy mean_sent_prop_joy
## <dbl> <dbl> <dbl>
## 1 1 6.70 0.0218
## 2 0 8.08 0.029
unigram correlation (tf-idf)
These words are the most important ‘joy’ unigrams to each respective news type.
## # A tibble: 2 x 3
## # Groups: fake [2]
## fake mean_word_sent_negative mean_sent_prop_negative
## <dbl> <dbl> <dbl>
## 1 1 15.9 0.0771
## 2 0 24.5 0.08
unigram correlation (tf-idf)
These words are the most important ‘negative’ unigrams to each respective news type.
## # A tibble: 2 x 3
## # Groups: fake [2]
## fake mean_word_sent_anger mean_sent_prop_anger
## <dbl> <dbl> <dbl>
## 1 1 7.67 0.0406
## 2 0 14.0 0.0561
unigram correlation (tf-idf)
These words are the most important ‘disgust’ unigrams to each respective news type.
## # A tibble: 2 x 3
## # Groups: fake [2]
## fake mean_word_sent_sadness mean_sent_prop_sadness
## <dbl> <dbl> <dbl>
## 1 1 7.74 0.0425
## 2 0 12.4 0.0432
unigram correlation (tf-idf)
These words are the most important ‘disgust’ unigrams to each respective news type.
Summarizing table of NRC sentiments
Table includes:
average total proportion of words in an article that hold X sentiment
NRC sentiments here: fear, sadness, disgust, anger, negative, joy